Information Extraction from Web Product Catalogues
نویسنده
چکیده
In this paper we present preliminary results for information extraction (IE) performed over a set of HTML documents using Hidden Markov Models (HMMs). In our experiments, we restrict ourselves to the domain of bike products sold on the Internet. The information to be extracted consists of bike model attributes and details regarding the company’s offer. We experiment with three approaches utilising HMMs and present results in terms of precision and recall.
منابع مشابه
Multimedia Information Extraction in Ontology-based Semantic Annotation of Product Catalogues
—The demand for efficient methods for extracting knowledge from multimedia content has led to a growing research community investigating the convergence of multimedia and knowledge technologies. In this paper we describe a methodology for extracting multimedia information from product catalogues empowered by the synergetic use and extension of a domain ontology. The methodology was implemented ...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملOntology-based Product Catalogues: An Example Implementation
Electronic Product Catalogues are the basis for offering and selling products in online market places. To be efficient, these catalogues have to provide a semantically precise description of product features to allow for effective matchmaking of products and customer requests. At the same time, the description has to follow a common terminology that allows the integration with the catalogues of...
متن کاملMultimedia information extraction from HTML product catalogues
We describe a demo application of information extraction from company websites, focusing on bicycle product offers. A statistical approach (Hidden Markov Models) is used in combination with different ways of image classification, including latent semantic analysis of image collections. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in ...
متن کاملState of the Art and Classification of Electronic Product Catalogues on CD-ROM
16 Introduction With the expansion of the services on the World Wide Web (WWW) and the distribution of information on CD-ROM, modern electronic support of advertising and sale of goods become a key factor in the marketing strategy of many companies. Information systems which focus their attention in multimedia presentation of products or services with functions that allow searching, selection, ...
متن کامل